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Abstract 

When faced with the task of forecasting a dynamic system, practitioners often have available his- 
torical data, knowledge of the system, or a combination of both. While intuition dictates that perfect 
knowledge of the system should in theory yield perfect forecasting, often knowledge of the system is 
only partially known, known up to parameters, or known incorrectly. In contrast, forecasting using pre- 
vious data without any process knowledge might result in accurate prediction for simple systems, but will 
fail for highly nonlinear and chaotic systems. In this paper, the authors demonstrate how even in chaotic 
systems, forecasting with historical data is preferable to using process knowledge if this knowledge 
exhibits certain forms of misspecification. Through an extensive simulation study, a range of misspec- 
ification and forecasting scenarios are examined with the goal of gaining an improved understanding 
of the circumstances under which forecasting from historical data is to be preferred over using process 
knowledge. 

1 Introduction 

Whether it be predicting traffic flows next week, the longevity of nuclear stockpiles, or the climate next 
century, forecasting on all scales has become a crucial part of modern society. How this forecasting of 
dynamic systems is performed, however, varies drastically. A city engineer may use historical traffic volume 
data to predict upcoming flow; a nuclear scientist may use complex multi-physics based models to predict 
the lifespan of existing nuclear stockpiles; a climate scientist may use both complex physics-based models 
and historical climate data to predict long-range climate forecasts. Typically, the choice of forecasting 
method is a function of the available resources, as well as the demands of the forecast itself. Using the 
earlier examples, perhaps the city engineer is unaware of traffic-flow models, yet has considerable data on 
historic traffic flows; because of the United States' ban on nuclear testing, a nuclear engineer is faced with 
lack of data, and hence must rely on complex deterministic models to predict stockpile lifetime; in contrast, 
the climate scientist has available a wealth of physics-based models which describe our world, as well as 
data taken from historical records and inferred from ice cores and other methods. 

What approach should be preferred? If one has available perfect understanding of the underlying system 
(including noise mechanisms), then in theory perfect predictions are available. Additionally, as one gains 
more and more data from a stationary system, they should be able to leverage this data to predict arbitrarily 
well (precluding the presence of noise). However, it is less clear in general how these two choices break 
down as system-knowledge is misspecified or historical data becomes limited. Highly nonlinear and chaotic 
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(including meteorological) systems will exacerbate any incorrect knowledge, to the point of long-range 
forecasts becoming nearly useless. It is therefore interesting to explore the effects of inaccurate information 
in the context of chaotic systems. 

Knowledge-based approaches to forecasting are particularly prominent in fields connected to the natural 
sciences. For instance, these models are used to forecast chemical reactions ( iflOll ). determine strengths 
of engineering structures ([12]), and price options in financial markets (0). Such models rely on a set 
of physical relationships expressed through mathematical formalism, usually differential equations. These 
equations may be solved in closed form or through approximation methods; one such method is finite- 
element modeling, which reduces a system of partial differential equations to an approximate system of 
ordinary differential equations which may be solved with Euler's Method or related techniques. 

In contrast to knowledge-based approaches, data-based approaches to forecasting rely solely on histori- 
cal data. For example, a retail outlet operator might predict next year's profits from previous years' records. 
As another example, each year the Prognostics and Health Management Society holds a challenge contest, 
whereby participants are given a data file with only rough details of the data's origin. From this, a model 
must be built which accurately predict a (withheld) testing set. The participant who is best able to predict 
the testing set is awarded a cash prize and given a special invited session at the society's conference. In 
situations where only data are available, one may select somewhat arbitrary parametric models (for instance 
linear regression or an autoregressive model) to reduce the dimensionality of the problem, or alternatively 
use non-parametric models such as support vector regression which make fewer assumptions on the distri- 
bution and relationships within the data. 

While we have discussed only the previous two types of forecasting, it is worth noting that others exist. 
For example, predictions of Oscar winners in a given year are a combination of data (the winners of the 
Golden Globes and other awards ceremonies) as well as expert knowledge of the films. However, for the 
purpose of this exposition, we wish to compare knowledge and data-based forecasting methods in the forms 
described above. Specifically, we will explore both of these two forms of forecasting under various levels of 
misspecification, testing when each method breaks down. 

The organization of the paper is as follows: Section 2 discusses in detail prediction from historical data, 
focusing on Support Vector Machine (SVM) forecasters - a non-parametric tool for predicting from data. 
We move to knowledge-based approaches in Section 3, in particular focusing on filtering as an optimal tool 
for removing noise from dynamic systems. The effects of misspecification are studied in Section 4, where an 
extensive simulation study is described. Section 5 concludes the work with a description of lessons learned. 

2 Forecasting with Historical Data: SVM Forecasters 

Given a set of historical data, there is a potentially infinite range of possible models which may be used 
to forecast future data. While the choice of model will be largely situation-dependent, such models may 
generally be classified as parametric or non-parametric. The former relies on describing the data through a 
set of relationships determined by a set of parameters. Such models include linear regression, generalized 
linear models, and autoregressive models. In contrast, nonparametric models use historical data directly, 
comparing it with future data through some metric from which a model may be built. This class includes 
neural networks and SVMs, the latter of which we now focus on in more detail. 

Specifically, we begin by describing least squares support vector machines (LS-SVMs). To this end, let 
L : lxH-> [0, oo) be the least squares loss defined by L(y,t) := (y—t) 2 ,y,t £ R. Moreover, let X C R d 
be a non-empty subset and k : X xX — )• Rbe a kernel, i.e. for every finite sequence {x\, X2, • • • , x n ) G X n , 
n > 1, the kernel matrix K := (k(xi, - =1 is positive semi-definite. A particular choice, which we will 
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consider throughout this work, is the Gaussian RBF kernel k a : H d x R d — > [0, oo) defined by 

k a (x, x) := exp^— a 2 \\x — x'W^j , x, x G R d , 

where || • H2 denotes the Euclidean norm and a > is a free parameter called the width. It is well-known 
(HI El), that to every kernel on X there is a uniquely determined reproducing kernel Hilbert space (RKHS) 
H on X, i.e., a Hilbert space consisting of functions that map from X to R such that k(x, • ) G H and 

f(x) = (f,k(x, •)> 

for all x e X and f £ H. 

Let us now suppose that we have a finite sequence D := ((xi, yi), . . . , (x n , y n )) £ (Ix R) n of labeled 
samples. Then the LS-SVM finds the unique solution fr>,\ G H of the convex optimization problem 




where A > is a regularization parameter and H is a reproducing kernel Hilbert space over X. This 
approach has been widely considered in the literature, we refer to lflTl[T6l l3l for the above formulation, and 
to lfl4ll for a formulation with an additional offset. In particular, it is well known that the solution fjj t \ G H 
is given by 

n 

Id,x = } j ajk(xi, ■ ), 
i=i 

where the vector a := (ax, . . . , a n ) G HI™ is given by 

a = 2\{K + \I)- 1 y, (2) 

where K denotes the kernel matrix that corresponds to the samples (x\, . . . , x n ), I is n-dimensional diago- 
nal matrix, and y := (yi, . . . , y n ) G R n denotes the vector of labels. Unfortunately, computing a directly 
via ([2]) requires a matrix inversion, which is an 0(n 3 ) operation and hence infeasible for larger datasets. 
On the other hand, [7] proposed a optimization algorithm that finds a by maximizing the dual optimization 
problem of ([T]) and that is usually computationally more efficient than the direct method. 



3 Forecasting with Process Knowledge: Filtering 

As discussed earlier, knowledge of the underlying process in the form of differential equations is often used 
to model complex systems. In addition, one typically has a training set to initialize the model. For instance, 
meteorological forecasts are continually tuned and adjusted given current measurements. When such noisy 
measurements of the system are available, the removal of noise to find the true underlying state is known as 
data assimilation. In the context of linear dynamic systems with Gaussian measurement noise, the standard 
choice is the Kalman filter. For more general classes of dynamic systems, more general filtering techniques 
exist, as we describe now. 

Whereas in the S VM forecaster framework data alone is used to model the system and hence provide pre- 
dictions, filtering relies on knowledge of the process underlying the system as well as noisy measurements 
of this process. Specifically, we assume that the system dynamics are known up to some parameter(s). The 
underlying state-space model may be written as 

Zt\Zt~\ ~ Pz,t(z\z t -l) 
Xt\z t ~ p x j(x\zt) 
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where Zt and xt denote the unobserved state and observation at time t, respectively; p Z}t and p x j are the 
state transition and measurement models, respectively. Also, we assume a prior distribution p(zo) on zq. In 
the case of linearly additive noise, we may write this state-space model as 



z t = f(z t - 1 \e) + r lt 
x t = h(zt) + e t . 



(3) 



Here both the stochastic noise r/ t and the measurement noise et are mutually independent and identically 
distributed sequences with known density functions. In addition, f(z t -i\6) and h(z t ) are known functions 
up to some parameters 6. 

In order to build the framework on which to describe the filtering methodologies employed, we first 
frame the above state-space model as a recursive Bayesian estimation problem. Specifically, we are inter- 
ested in obtaining the posterior distribution 



where zq-± = {zq, z\, ... , z t } and x\-t = {x\, x%, . . . , Xt}. Often we don't require the entire posterior 
distribution, but merely one of its marginals. For instance, we are often interested in the estimate of state 
given all observations up to that point; we call this distribution the filtering density and denote it as 



By knowing this density, we are able to make estimates about the system's state, including measures of 
uncertainty such as confidence intervals. Also, once an estimate of the state is obtained, the system may 
be propagated forward using / to obtain forecasts. If the functions / and h are linear and both rjt and e t 
are Gaussian, Kalman filtering is able to obtain the filtering distribution in analytic form. In fact it can be 
seen that all of the distributions of interest are Gaussian with means and covariances that can be simply 
calculated. However, when the dynamics are non-linear or the noise non-Gaussian, the Kalman filter is only 
able to provide a linear approximation, hence we must turn to alternative methods. 

In the case of non-linear dynamics with Gaussian noise, the standard methodology is the extended 
Kalman filter, which may be considered as a nonlinear Kalman filter which linearizes around the current 
mean and covariance. However, as a result of this linearization, the filter may diverge if the initial state 
estimate is wrong or the process is incorrectly modeled. In addition, the calculation of the Jacobian in the 
extended Kalman filter can become tedious in high-dimension problems. Because of this we turn to the 
unscented Kalman filter of [17], which approximates the nonlinearity by transforming a random variable 
instead of through a Taylor expansion, as the extended Kalman filter does. The method uses a set of strate- 
gically chosen sample points which are evolved using the exact nonlinear dynamics, yet still capture the 
posterior mean and covariance accurate to the second order. By employing a deterministic sampling tech- 
nique known as the unscented transform ([6]), UKF selects a minimal set of sample points around the mean 
which are then propagated through the non-linear functions while recovering the covariance matrix. 

When either the stochastic or measurement noise is non-Gaussian, Monte Carlo methods must be em- 
ployed, in particular particle filters ([5]). This Monte Carlo based filtering method relies on a large set of 
samples, called particles, which are evolved through the system dynamics with potentially non-Gaussian 
noise using importance sampling and bootstrap techniques. At each time step the empirical distribution of 
these particles is used to approximate the distribution of interest and its associated features. By sampling 
from some proposal distribution q{zQ._t\x\-_t) in order to approximate (j4j), we may use importance sampling 
with corresponding unnormalized weights 



P{ZQ:t\x\-.t) 



(4) 



p(z t \xi:t). 



(5) 



w t = 



P{xi:t\ZQ:t)P{zQ:t) 
q{zO:t\X\:t) 
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However, we typically wish to perform this estimation sequentially, and hence we can take advantage of 
the Markov nature of the state and measurement process along with proposal distributions of the form 
q(zo:t\%l:t) = q{zo-t-i\xi : t-i)q{zt\zQ-t-i,x\ : t). From this we obtain the recursive weight formula 

P(xt\zt)P{zt\zt-i) 

W t = W t -1 -—, r— . 

q(Zt\ZO:t-l,Xl-t) 

This equation allows for the sequential updating of importance weights given an appropriate choice of pro- 
posal distribution q(zt\zo-.t-i, xv.t), as well as simple calculation of the filtering density Since we can 
sample from this proposal distribution and evaluate the likelihood and transition probabilities, the particle 
filter simply involves generating a prior set of samples, evolving these samples forward with the proposal 
distribution, and subsequently calculating the importance weights. In addition, to prevent particle degener- 
acy, we employ a resampling step to remove particles with low weight and multiply those with high weight 
(11). 

Often the problem of filtering isn't restricted to the estimation of state, but is also concerned with esti- 
mating some parameters 9 of the dynamic model f(z\9). Further complicating matters, the only information 
we have about the state and the model parameters is the noisy measurements {xt} t>1 . While there are sev- 
eral approaches for solving this problem, we focus on dual estimation, namely the use of parallel filters to 
estimate state and model parameters ( lfT8l ). Specifically, we use a state-space representation for both the 
state and parameter estimate problems. While the state-space representation for the state is given in equation 
Q, the representation of the model parameters is given by 

9 t = 6 t -i + vt 

x t = f{zt-\\9 t ) +Vt + e t - 

Here rjt and et are as in Q, while vt is an additional iid noise term. Thus one can run two parallel UKF 
filters for both state and parameters. At each time step the current state estimate is used in the parameter 
filter and the current parameter estimate is used in the state filter. Many more details on particle filtering 
may be found in the edited volume of lH. 

4 Examining the Effects of Misspecification 

For the purpose of this article, we define misspecification broadly, including what may traditionally be 
termed "underspecification." Firstly, mathematical models are often incapable of capturing the complete 
complexity of physical systems, and hence are often only an approximation. Hence the difference between 
the mathematical model and underlying physical system might be considered misspecification, as might the 
use of incorrect parameters in these models. Alternatively one might have to attempt to learn the parameters 
from available data. For data-based models, misspecification can be understand in terms of lack of avail- 
ability of data. While a small, stationary system may be fully understood from a handful of data points, 
complex and chaotic systems might require considerable amounts of data just to gain a rough understanding 
of the relationships between variables. 

We now proceed to conduct an elaborate simulation study to gain further insight into the mechanisms by 
which misspecification causes these forecasting methods to break down. Specifically, we look at data-based 
forecasting methods for range of amounts of historical data and training lengths, studying their prediction 
performance in both the immediate and distant future. Similarly for knowledge-based forecasting methods, 
we explore misspecification in terms of unavailable training data, unknown parameters, and unknown noise 
structure. We look at SVM forecasters and filtering, both discussed earlier and in the appendix, as our 
primary focus of study. As will be seen, many of the lessons learned from these models may be generalized 
to their respective classes of forecasters. 
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Historical Data Size 


Repetitions 


500 


100 


1000 


100 


2000 


10 


4000 


10 


8000 


5 


16000 


5 



Table 1 : For each historical data set size (first column) we repeat the forecasting experiment a number of 
times (second column); the forecasting performance is averaged for each size of the historical data set. 



To really stretch the limits of the forecasting methods, we test them on a chaotic system, specifically the 
Lorenz-3 system ([9]). The 3-dimensional system, parameterized by the variables a, b and r is described by 
the following differential equations 

dx , . dy . . dz 

— = a(y — x), — = x(r — z) — y, — = xy — bz 

dt {y " dt v ' dt y 

and for our purposes is truncated using Runge-Kutta. From this underlying system, we add both system and 
observation noise (described in more detail later). 



4.1 Experimental Design 

We now describe the setup of the numerical experiments performed to compare the SVM forecaster and 
filtering approaches to forecasting. All generated data is from the fore-mentioned Lorenz-3 system, with 
added noise as described later. The two forecasters - SVM and filtering - each have access to a noisy se- 
quence of T p consecutive past observations, X := (x-t p +i, ■ ■ ■ > ^o)> called the training set. The goal of the 
forecaster is to predict the state xt s of the dynamical system at future time Tj. The prior information avail- 
able to the forecasters is different. The filter will have partial or complete knowledge about the dynamical 
system and the noise process, while the SVM forecaster will have no information about the dynamics or the 
observational noise process: instead, this forecaster uses additional past noisy observations (in addition to 
the training set X). Thus the SVM forecast is entirely a data-based model of the dynamics. 

More specifically, the SVM forecaster has available to it a set of historical data used to "learn" the 
system. We explore a wide range of lengths of historical data, ranging from 500 to 16, 000 in two-fold 
increments. For each of these lengths, we repeat the experiment several times and use the average perfor- 
mance. Due to the computational intensity of training the SVM forecaster, we limit the number of repetitions 
for the larger historical data sets as shown in Table [T] As for the filter, we explore 2 filters with different 
observation noise structures - Gaussian and Laplace, to test the sensitivity of forecasting to the specified 
noise structure. The Gaussian filter is fit using the UKF, while the Laplace filter is fit using particle filtering. 
The general setup of the experiment is as follows: 

1. Generate a long trajectory of the dynamical system. 

2. Use a portion of the data to generate the historical data sets (see Table[T]) and build the SVM forecasters 
for each. 

3. Generate n = 1000 random indices i that describe a sample (X, xx f ) in the sense that i points to the 
xq state. 
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System 


a 


b 


r 


Stochastic 
Noise 


Observation 
Noise 


Filter Knowledge 


DS1 


10.0 


8.0/3.0 


28.0 


None 


iV(0,0.80) 


All Known 


DS2 


8.13 


0.53 


35.23 


None 


iV(0,1.13) 


All Unknown 


DS3 


12.18 


0.52 


13.14 


None 


J7([-0.5,0.5]) 


All Unknown 


DS4 


4.82 


0.63 


20.09 


None 


.5V(0.1,0.25) 


All Unknown 












+.57V(-0.1,0.5) 




DS5 


3.34 


0.54 


23.49 


None 


.8N (0.1, 0.25) 


All Unknown 












+.2C/([-0.1,0.5]) 




DS6 


9.57 


3.04 


27.32 


7V(0,0.1) 


0.5{exp(l)/4} 


All Unknown 












+.5 * {-exp(l)/4} 





Table 2: Description of the dynamic systems used in the forecasting experiments. Except for DS1, where 
both the Lorenz parameters and the observational noise parameters are known, the system and noise pa- 
rameters are not known to the forecaster. Here, N(m,a) denotes a Gaussian distribution with mean m and 
standard deviation a, U([a, b]) is uniform noise in the range [a, b], exp(A) denotes the exponential distribu- 
tion of parameter A, while p\N(m\,cji) + p2N(m,2, 02) denotes a mixture of Gaussian distributions where 
pi is the probability of sampling from N(mi, 01) and P2 is the probability of sampling from N(rri2, 02); a 
similar interpretation holds for the other mixture distributions in the table. 



4. For each index % representing a sample (X,xr f ), compute the predictor f(X) for each model and 
save it. For each index i we examine a range of past training set lengths T p = 5, 10, 20, 50, 100, 1000 
and future forecasting lengths T f = 1, 5, 10, 20, 30, 40, 50. 

5. For the SVM forecaster, the above is repeated for the different historical data sets (including replica- 
tion) and the results averaged for each set. 

As discussed earlier, we explore a range of misspecification scenarios, starting from fully specified 
through to drastically misspecified. Table [2] describes the 6 systems under consideration in detail. We see 
from this table that DS 1 is quite favorable to the filter, as all aspects of the system are known. DS2 through 
DS6 contain various levels of noise, and the filter attempts to learn the system in each case. Worthy of 
note is DS6, which in addition to observation noise contains stochastic noise. Details of the tuning of each 
forecasting method are deferred to the appendix. 

4.2 Results of Experiment 

We now proceed to discuss the results of the experiment in order of the dynamic systems studied. For 
each system we plot the 1, 10, 30, and 50 step ahead forecasting RMSE. Due to space constraints and the 
similarities between DS3, DS4, and DS5, we exclude the plots of DS4 and DS5. Because the maximum 
embedding length of the SVM forecaster is set to 20, we indicate performance for larger training sizes with 
a dashed line, as all training data in addition to the first 20 is not used by the forecaster. 

We first look at DS1 (Figure [TJ, in which the filter has complete knowledge of the system, and hence 
the filter can be built with these quantities fixed. Since the noise and parameters are fixed in the filter, all 
that is needed is time to locate the state. As such, we see slight improvements in prediction performance 
with increases in embedding length, as the state is able to be located more accurately and precisely. As 
expected, we find the RMSE of the filtering method to dominate the SVM forecaster, with the Gaussian filter 
(which uses the correctly specified noise) outperforming the Laplace filter, which has incorrectly specified 
noise structure. The improvement of using knowledge via the filter in this case is more pronounced for 
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larger forecasting lengths, as the SVM forecaster is not able to accurately capture dynamics of this range. 
Interestingly, the SVM forecaster seems to do no better with additional training data (5, 10, or 20). Where 
significant gains are made by this forecaster is with larger historical data sets. 

Looking at DS2 (Figure|2]), we see some interesting features of the forecasting performance. Specifically, 
because we initialize the state at the first observation, the filter does not have much time to diverge with 
small training sets, and hence for small training sets and short forecasting lengths, performance is excellent. 
However, as more training data is observed and the model attempts to learn the system parameters, the 
estimate of state is compromised, resulting in worsened forecasts. However, as we see 600+ training data, 
both state and parameters are found, and hence prediction capabilities equal or surpass the SVM forecaster. 
Additionally, note that the Laplace filter (fit using particle filtering) outperforms the Gaussian. This is likely 
due to the larger system variance, for which the Laplace noise is able to sufficiently accommodate, whereas 
the specified standard deviation of the Gaussian filter is smaller than the actual observation noise (0.8 vs. 
1.13) and hence outlying points are too influential. Similar results are observed for DS3 through DS5 (DS3 
shown in Figure [3]), however the additional noise misspecification results in the SVM forecasters with long 
historical data sets to uniformly dominate the filters. 

As we add in stochastic noise in DS6 (Figure [4]), we observe very interesting features. Firstly, the 
SVM forecaster becomes worse with increased training set length for short-range forecasts. Because of the 
relative smoothness of the system, this indicates that even with stochastic noise the system has little chance 
to deviate significantly. While it is initially quite curious that the filters outperform the SVM forecasters for 
small training set, this is in fact due to the initialization of the filters. Specifically, the prior on the parameter 
is set at the values for DS1, specifically 10.0, 8.0/3.0, and 28.0, whereas the DS6 system parameters are 
9.57, 3.04, and 27.32 which is remarkably close. Thus for short forecasting lengths, the prior is quite 
accurate, and hence forecasts are quite good. However, as more training data is observed, the filters diverge 
from the true parameters and state, and hence we see forecasting degrades to roughly the average of the SVM 
forecasters. Thus in this example, we see that barring lucky initialization, SVM forecasters are preferable if 
sufficient historical data is available. However, because of the limited information in the historic data due to 
the presence of stochastic noise, the filter may outperform the SVM for moderate historical data. 

It is evident from the above results that the time to convergence of both the parameter and state depend 
on the initial conditions. As such, we conduct a sub-experiment to test model parameter convergence in 
the Lorenz system. Specifically, we use DS 1 with known noise, and test different initial conditions on the 
parameters. We add multiples of the vector [oo, ro, bo] to the true (known) parameters [10, 28, 8/3] as our 
initial condition. Here o"o = ±2, ro = ±3, 6o = ±1, each with probability 1/2. Here we use five different 
levels of initial conditions corresponding to multiplying the above vector by one through five. By plotting 
the average MSE over the three parameters we obtain Figure 5(a) From this we see that for initial conditions 
close to the true value, convergence is obtained nearly monotonically. However, an interesting trend is seen 
for less accurate initial conditions. Specifically, initially the convergence looks good, then around time 200 
the MSE begins to climb. However, this is not a consistent trend in all of the repetitions. Rather, this 
phenomenon is created by a small number of repetitions resulting in parameters which don't converge. An 
example of converging (and not converging) parameters is shown in Figure 5(b) Because of the chaotic 
nature of the Lorenz system, the parameters often end up on the wrong attractor when poorly initialized, 
and as a result the parameter estimates do not converge to the true value. In fact, while most repetitions 
resulted in convergence of the model parameters, several resulted in similar results as Figure 5(b) with the 
two variables (blue and green, specify here) staying constant, and the red variable (specify) oscillating in a 
sinusoidal pattern, indicating that the filter has fallen on a different attractor with similar location as the true 
solution. 
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Figure 1: Root mean squared error of forecasting methods from DS1 vs. length of training set. Four panes 
represent forecasting 1, 10, 30, and 50 steps ahead. 
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Figure 2: Root mean squared error of forecasting methods from DS2 vs. length of training set. Four panes 
represent forecasting 1, 10, 30, and 50 steps ahead. 
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Figure 3: Root mean squared error of forecasting methods from DS3 vs. length of training set. Four panes 
represent forecasting 1, 10, 30, and 50 steps ahead. 
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Training Set Length Training Set Length 



Figure 4: Root mean squared error of forecasting methods from DS6 vs. length of training set. Four panes 
represent forecasting 1, 10, 30, and 50 steps ahead. 



12 



— +/-[10,15,5] 





+/-[8,12,4] 




+'-[6,9,3] 


-2 


+/-[4,6,2] 




+/-[2,3,1] 


-4 




-6 




100 200 300 400 500 600 700 

Iteration 



900 1000 




100 200 300 400 500 

Iteration 



700 800 900 1000 



(a) Model parameter convergence for different initial conditions (b) Example of model parameter convergence and divergence 

Figure 5: Model parameter convergence 



5 Conclusion 

In this paper, we have explored the effects of misspecification on forecasting methods, where misspecifica- 
tion is defined to include all factors leading to incomplete knowledge of the system at hand, either through 
lack of knowledge of the system structure, or lack of historical data to learn the system. We have explored 
the effects of misspecification on both data and knowledge -based forecasting methods, demonstrating the 
importance of initialization and adequate levels of historical data. We have observed that in knowledge- 
based models where parameters must be learned, care must be taken to ensure sufficient training data is 
used to ensure accurate parameter estimation. As the knowledge-based forecaster takes on more aspects of 
misspecification, the data-based forecaster will outperform it given adequate data. 

Through an extensive simulation study, we have demonstrated these notions empirically. Using the 
Lorenz-3 dynamic system as an example, we have explored filtering - a knowledge -based forecasting 
method, and SVM forecasters - a data-based forecasting method. When the filter is provided with com- 
plete and accurate information, it dominates the data-based method. However, when the parameters must 
be learned and noise is misspecified, the SVM forecaster outperforms the filter. When stochastic noise is 
introduced, training data has little positive effect on the forecasters, as there is so little information on the 
system parameters in the data itself. If parameters are initialized correctly, convergence and therefore ac- 
curate forecasting is likely. However, when initialized poorly, we have demonstrated that in such chaotic 
systems as Lorenz-3, the parameters might converge to another mode in the posterior surface of parameters, 
and hence prediction (particularly long range) will be severely impacted. 
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6 Appendix 



6.1 Appendix 1: Experimental Details of SVM Forecaster 

Let us now describe our experimental setup for the SVM forecaster. Using an LS-S VM with Gaussian kernel 
we need to determine both the regularization parameter A and the kernel width a. To this end, we first scaled 
the input dimensions, i.e., the past observations, of the training set to the interval [—1,1] and saved the 
corresponding affine linear transformation so that it can later be applied to the test set. We further split the 
training set into four equal parts to perform 4-fold cross validation, and considered a grid of candidate values 
for both A and a, where for A we picked a grid with endpoints A m i n := 10(3n/4)~ 2 and A max '■= 1.0, and 
for a we selected a grid with endpoints c m i n := 0.1 and er max := 2.0(3ra/4) 1// ( 3M ), where n is the number 
of samples in the training set and M is the length of the considered history. Here we note that that the 
behaviour of these endpoints in n and M are suggested by theoretical results on support vector machines in 
the i.i.d. case, see lfl3l Chapter 7.4]. Though strictly speaking these results do not apply to our datasets, we 
found that they were actually a good choice in all cases. In addition, note that the factor 3/4 was used to 
address the fact that in the 4-fold cross validation procedure the actual number of samples used for training 
was approximately 3/4n. In both cases, the grid contained 10 geometrically distributed points. We then 
performed 4-fold cross validation to determine the parameter pair (A*, a*) from the grids that had the best 
cross-validation performance. In other words, for each parameter pair we used three fourths of the training 
set for computing the LS-SVM and one fourth for computing the validation error and repeated this four 
times with changing validation set. The pair (A*, a*) with the smallest average validation error was then 
picked. 

This cross-validation procedure was repeated for all embedding dimensions M = 1, . . . , 50, where for 
M > 5 we stopped considering higher embedding dimensions if the best cross-validation performance of 
the current embedding dimension was 1.02 times larger than the best cross-validation performance of all 
previously considered embedding dimensions. We then picked the embedding dimensions M* that had the 
best cross-validation performance over the ranges of allowed embedding dimensions. For example, if we 
were allowed to use up to 5 embedding dimensions, we picked the M* £ {1, 2, 3, 4, 5} with the best cross 
validation performance. Once we had determined the embedding dimensions M* and the hyper-parameter 
pair (A*, a*) we then computed the LS-SVM on the entire training set for (4/3A*, a*), where we note that 
this sort retraining is the standard procedure in machine learning and we had made very good experience with 
this particualr rescaling of the parameters in many previous experiments with SVMs on other datasets. For 
this LS-SVM we then computed the test error with the help of the hold out test set whose input dimensions 
where scaled by the saved affine linear transformation we found on the training set. 

6.2 Appendix 2: Experimental Details of Filtering Methods 

In each of the Gaussian filters for DS1 through DS6, standard deviation 0.8 is used, and the UKF is used 
to fit the model. We also employ the particle filter using Laplace noise e t and setting P(zq) ~ N(yi, So), 
where So is chosen to be a diagonal matrix with all elements equal 2. 

The choice of proposal distribution has a significant effect on the rate of degeneracy. The standard (and 
simplest) choice is the prior distribution q(zt\zo-.t-i,%i:t) = P(zt\%t-i) since the weights simplify to a 
calculation of the likelihood. However, if the likelihood is not near the prior, this choice will lead to large 
variance in the importance weights, and hence we would like to employ a proposal distribution which uses 
the data to provide a better estimate of the posterior distribution. One such possibility is to use a Gaussian 
approximation of the posterior as the proposal distribution. We employ the UKF as previously developed to 
create a proposal distribution for each particle as in [fT5l. Thus we run the UKF algorithm, but in addition 
carry a set of particles which are reweighted and resampled based on the UKF foundation. The situation is 
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complicated in the particle filter situation, due to the well-known problem of degenerate weights ([5]). As 
a solution to this problem, we employ a Gaussian approximation using the UKF for the parameter filter in 
that case. In addition, to encourage parameter convergence we employ a tempering effect on v t to shrink it 
towards zero as a function of time ([8]). 

Once the end of the observed data {yt\ t >\ * s reached, we end up with a current estimate of state (and 
perhaps model parameters). With forecasting in mind, we then use this estimate of state and model pa- 
rameters (zt,0t), combined with the state evolution equation f(x\9) to propagate the system foreword. 
Specifically, the estimate at one time step past the final observation xt is zr+i = I{zt\0t), at two time 
steps past is zt+2 = f{zT+i\@o), and so forth. 
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